Regression - Dozer Price Prediction
Brief description of predicting bulldozer sale prices using the Bluebook dataset.
Dataset Source: Kaggle Bluebook for Bulldozers Problem Type: Regression Target Variable: SalePrice - The sale price of the bulldozer at auction Use Case: Price prediction for heavy equipment, identifying arbitrage opportunities in equipment sales
Package Imports
Xplainable Cloud Setup
Data Loading and Exploration
Load the Bluebook for Bulldozers dataset
It's possible to download the Bluebook dozer price prediction dataset at the following link: https://www.kaggle.com/c/bluebook-for-bulldozers/data
Following extraction of the .zip file build the dataset as below:
| SalesID | SalePrice | MachineID | ModelID | datasource | auctioneerID | YearMade | MachineHoursCurrentMeter | UsageBand | saledate | ... | Undercarriage_Pad_Width | Stick_Length | Thumb | Pattern_Changer | Grouser_Type | Backhoe_Mounting | Blade_Type | Travel_Controls | Differential_Type | Steering_Controls | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 1139246 | 66000 | 999089 | 3157 | 121 | 3 | 2004 | 68 | Low | 2006-11-16 | ... | nan | nan | nan | nan | nan | nan | nan | nan | Standard | Conventional |
| 1 | 1139248 | 57000 | 117657 | 77 | 121 | 3 | 1996 | 4640 | Low | 2004-03-26 | ... | nan | nan | nan | nan | nan | nan | nan | nan | Standard | Conventional |
| 2 | 1139249 | 10000 | 434808 | 7009 | 121 | 3 | 2001 | 2838 | High | 2004-02-26 | ... | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan |
| 3 | 1139251 | 38500 | 1026470 | 332 | 121 | 3 | 2001 | 3486 | High | 2011-05-19 | ... | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan |
| 4 | 1139253 | 11000 | 1057373 | 17311 | 121 | 3 | 2007 | 722 | Medium | 2009-07-23 | ... | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan |
| SalesID | SalePrice | MachineID | ModelID | datasource | auctioneerID | YearMade | MachineHoursCurrentMeter | UsageBand | saledate | ... | Undercarriage_Pad_Width | Stick_Length | Thumb | Pattern_Changer | Grouser_Type | Backhoe_Mounting | Blade_Type | Travel_Controls | Differential_Type | Steering_Controls | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 1139246 | 66000 | 999089 | 3157 | 121 | 3 | 2004 | 68 | Low | 2006-11-16 | ... | nan | nan | nan | nan | nan | nan | nan | nan | Standard | Conventional |
| 1 | 1139248 | 57000 | 117657 | 77 | 121 | 3 | 1996 | 4640 | Low | 2004-03-26 | ... | nan | nan | nan | nan | nan | nan | nan | nan | Standard | Conventional |
| 2 | 1139249 | 10000 | 434808 | 7009 | 121 | 3 | 2001 | 2838 | High | 2004-02-26 | ... | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan |
| 3 | 1139251 | 38500 | 1026470 | 332 | 121 | 3 | 2001 | 3486 | High | 2011-05-19 | ... | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan |
| 4 | 1139253 | 11000 | 1057373 | 17311 | 121 | 3 | 2007 | 722 | Medium | 2009-07-23 | ... | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan |
Add the machine appendix to concatenate information about the dozer assets
| MachineID | ModelID | fiModelDesc | fiBaseModel | fiSecondaryDesc | fiModelSeries | fiModelDescriptor | fiProductClassDesc | ProductGroup | ProductGroupDesc | MfgYear | fiManufacturerID | fiManufacturerDesc | PrimarySizeBasis | PrimaryLower | PrimaryUpper | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 113 | 1355 | 350L | 350 | nan | nan | L | Hydraulic Excavator, Track - 50.0 to 66.0 Metr... | TEX | Track Excavators | 1994 | 26 | Caterpillar | Weight - Metric Tons | 50 | 66 |
| 1 | 434 | 3538 | 416C | 416 | C | nan | nan | Backhoe Loader - 14.0 to 15.0 Ft Standard Digg... | BL | Backhoe Loaders | 1997 | 26 | Caterpillar | Standard Digging Depth - Ft | 14 | 15 |
| 2 | 534 | 3538 | 416C | 416 | C | nan | nan | Backhoe Loader - 14.0 to 15.0 Ft Standard Digg... | BL | Backhoe Loaders | 1998 | 26 | Caterpillar | Standard Digging Depth - Ft | 14 | 15 |
| 3 | 718 | 3538 | 416C | 416 | C | nan | nan | Backhoe Loader - 14.0 to 15.0 Ft Standard Digg... | BL | Backhoe Loaders | 2000 | 26 | Caterpillar | Standard Digging Depth - Ft | 14 | 15 |
| 4 | 1753 | 1580 | D5GLGP | D5 | G | nan | LGP | Track Type Tractor, Dozer - 85.0 to 105.0 Hors... | TTT | Track Type Tractors | 2006 | 26 | Caterpillar | Horsepower | 85 | 105 |
Merging the dataset on the MachineID to extract useful information:
- Find the columns that exist within the machine dictionary that aren't in the training dataset
- Merge the new columns on the existing train dataset to enrich the information
| MfgYear | fiManufacturerID | fiManufacturerDesc | PrimarySizeBasis | PrimaryLower | PrimaryUpper | |
|---|---|---|---|---|---|---|
| 0 | 1994 | 26 | Caterpillar | Weight - Metric Tons | 50 | 66 |
| 1 | 1997 | 26 | Caterpillar | Standard Digging Depth - Ft | 14 | 15 |
| 2 | 1998 | 26 | Caterpillar | Standard Digging Depth - Ft | 14 | 15 |
| 3 | 2000 | 26 | Caterpillar | Standard Digging Depth - Ft | 14 | 15 |
| 4 | 2006 | 26 | Caterpillar | Horsepower | 85 | 105 |
1. Data Preprocessing
Feature Engineering and Data Preparation
Preprocessor Persistence
Save the preprocessing pipeline spec to Xplainable Cloud for reproducibility.
| SalesID | SalePrice | MachineID | ModelID | datasource | auctioneerID | YearMade | MachineHoursCurrentMeter | UsageBand | fiModelDesc | ... | SteeringControls | MfgYear | fiManufacturerID | fiManufacturerDesc | PrimarySizeBasis | PrimaryLower | PrimaryUpper | saleyear | salemonth | saledayofweek | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 1139246 | 66000.0 | 999089 | 3157 | 121 | 3.0 | 2004 | 68.0 | Low | 521D | ... | Conventional | 2004.0 | 25 | Case | Horsepower | 110.0 | 120.0 | 2006 | 11 | Thursday |
| 1 | 1139248 | 57000.0 | 117657 | 77 | 121 | 3.0 | 1996 | 4640.0 | Low | 950FII | ... | Conventional | 1996.0 | 26 | Caterpillar | Horsepower | 150.0 | 175.0 | 2004 | 3 | Friday |
| 2 | 1139249 | 10000.0 | 434808 | 7009 | 121 | 3.0 | 2001 | 2838.0 | High | 226 | ... | nan | 2001.0 | 26 | Caterpillar | Operating Capacity - Lbs | 1351.0 | 1601.0 | 2004 | 2 | Thursday |
| 3 | 1139251 | 38500.0 | 1026470 | 332 | 121 | 3.0 | 2001 | 3486.0 | High | PC120-6E | ... | nan | 2010.0 | 103 | Komatsu | Horsepower | 225.0 | 250.0 | 2011 | 5 | Thursday |
| 4 | 1139253 | 11000.0 | 1057373 | 17311 | 121 | 3.0 | 2007 | 722.0 | Medium | S175 | ... | nan | 2007.0 | 121 | Bobcat | Operating Capacity - Lbs | 1601.0 | 1751.0 | 2009 | 7 | Thursday |
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 412693 | 6333344 | 10000.0 | 1919201 | 21435 | 149 | 2.0 | 2005 | nan | nan | 30NX | ... | nan | 2005.0 | 2552 | IHI | Weight - Metric Tons | 2.0 | 3.0 | 2012 | 3 | Wednesday |
| 412694 | 6333345 | 10500.0 | 1882122 | 21436 | 149 | 2.0 | 2005 | nan | nan | 30NX2 | ... | nan | 2005.0 | 2552 | IHI | Weight - Metric Tons | 3.0 | 4.0 | 2012 | 1 | Saturday |
| 412695 | 6333347 | 12500.0 | 1944213 | 21435 | 149 | 2.0 | 2005 | nan | nan | 30NX | ... | nan | 2005.0 | 2552 | IHI | Weight - Metric Tons | 2.0 | 3.0 | 2012 | 1 | Saturday |
| 412696 | 6333348 | 10000.0 | 1794518 | 21435 | 149 | 2.0 | 2006 | nan | nan | 30NX | ... | nan | 2006.0 | 2552 | IHI | Weight - Metric Tons | 2.0 | 3.0 | 2012 | 3 | Wednesday |
| 412697 | 6333349 | 13000.0 | 1944743 | 21436 | 149 | 2.0 | 2006 | nan | nan | 30NX2 | ... | nan | 2005.0 | 2552 | IHI | Weight - Metric Tons | 2.0 | 3.0 | 2012 | 1 | Saturday |
Train on the top 6 dozers assets by count
For timeliness of training filter the data on the Top 6 assets by count
| SalesID | SalePrice | MachineID | ModelID | datasource | auctioneerID | YearMade | MachineHoursCurrentMeter | UsageBand | fiModelDesc | ... | SteeringControls | MfgYear | fiManufacturerID | fiManufacturerDesc | PrimarySizeBasis | PrimaryLower | PrimaryUpper | saleyear | salemonth | saledayofweek | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 5 | 1139255 | 26500.0 | 1001274 | 4605 | 121 | 3.0 | 2004 | 508.0 | Low | 310G | ... | nan | 2004.0 | 43 | John Deere | Standard Digging Depth - Ft | 14.0 | 15.0 | 2008 | 12 | Thursday |
| 10 | 1139278 | 24000.0 | 1024998 | 4605 | 121 | 3.0 | 2004 | 1414.0 | Medium | 310G | ... | nan | 2004.0 | 43 | John Deere | Standard Digging Depth - Ft | 14.0 | 15.0 | 2008 | 8 | Thursday |
| 15 | 1139291 | 19000.0 | 1004810 | 4604 | 121 | 3.0 | 1999 | 2450.0 | Medium | 310E | ... | nan | 1999.0 | 43 | John Deere | Standard Digging Depth - Ft | 14.0 | 15.0 | 2006 | 11 | Thursday |
| 62 | 1139469 | 23000.0 | 1058869 | 3171 | 121 | 3.0 | 1998 | 9987.0 | High | 580L | ... | nan | 1998.0 | 25 | Case | Standard Digging Depth - Ft | 14.0 | 15.0 | 2007 | 5 | Thursday |
| 82 | 1139515 | 33000.0 | 1015565 | 4605 | 121 | 3.0 | 2002 | 1268.0 | Medium | 310G | ... | nan | 2002.0 | 43 | John Deere | Standard Digging Depth - Ft | 14.0 | 15.0 | 2004 | 7 | Thursday |
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 410243 | 6288239 | 18200.0 | 1835461 | 4604 | 149 | 99.0 | 2000 | 48.0 | Low | 310E | ... | nan | 2000.0 | 43 | John Deere | Standard Digging Depth - Ft | 14.0 | 15.0 | 2012 | 2 | Wednesday |
| 410244 | 6288240 | 25250.0 | 1903914 | 4605 | 149 | 0.0 | 2005 | 1988.0 | Low | 310G | ... | nan | 2005.0 | 43 | John Deere | Standard Digging Depth - Ft | 14.0 | 15.0 | 2012 | 1 | Saturday |
| 410245 | 6288241 | 25250.0 | 1860549 | 4605 | 149 | 99.0 | 2006 | nan | nan | 310G | ... | nan | 2006.0 | 43 | John Deere | Standard Digging Depth - Ft | 14.0 | 15.0 | 2012 | 4 | Wednesday |
| 410246 | 6288243 | 25000.0 | 1846184 | 4605 | 149 | 1.0 | 2006 | nan | nan | 310G | ... | nan | 2006.0 | 43 | John Deere | Standard Digging Depth - Ft | 14.0 | 15.0 | 2012 | 3 | Thursday |
| 410264 | 6288346 | 20500.0 | 1867087 | 4604 | 149 | 4.0 | 2000 | nan | nan | 310E | ... | nan | 2000.0 | 43 | John Deere | Standard Digging Depth - Ft | 14.0 | 15.0 | 2012 | 2 | Monday |
| SalesID | SalePrice | MachineID | ModelID | datasource | auctioneerID | YearMade | MachineHoursCurrentMeter | UsageBand | fiModelDesc | ... | TireSize | MfgYear | fiManufacturerID | fiManufacturerDesc | PrimarySizeBasis | PrimaryLower | PrimaryUpper | saleyear | salemonth | saledayofweek | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 5 | 1.13926e+06 | 26500 | 1.00127e+06 | 4605 | 121 | 3 | 2004 | 508 | Low | 310G | ... | nan | 2004 | 43 | John Deere | Standard Digging Depth - Ft | 14 | 15 | 2008 | 12 | Thursday |
| 10 | 1.13928e+06 | 24000 | 1.025e+06 | 4605 | 121 | 3 | 2004 | 1414 | Medium | 310G | ... | nan | 2004 | 43 | John Deere | Standard Digging Depth - Ft | 14 | 15 | 2008 | 8 | Thursday |
| 15 | 1.13929e+06 | 19000 | 1.00481e+06 | 4604 | 121 | 3 | 1999 | 2450 | Medium | 310E | ... | nan | 1999 | 43 | John Deere | Standard Digging Depth - Ft | 14 | 15 | 2006 | 11 | Thursday |
| 62 | 1.13947e+06 | 23000 | 1.05887e+06 | 3171 | 121 | 3 | 1998 | 9987 | High | 580L | ... | nan | 1998 | 25 | Case | Standard Digging Depth - Ft | 14 | 15 | 2007 | 5 | Thursday |
| 82 | 1.13952e+06 | 33000 | 1.01556e+06 | 4605 | 121 | 3 | 2002 | 1268 | Medium | 310G | ... | nan | 2002 | 43 | John Deere | Standard Digging Depth - Ft | 14 | 15 | 2004 | 7 | Thursday |
Addressing Multicollinearity in Model Interpretability
It's well-understood in data science that multicollinearity can significantly hamper the interpretability of models, particularly those based on linear assumptions. The code snippet above demonstrates a rudimentary approach to mitigating multicollinearity by removing highly correlated features. However, it's important to acknowledge that this is a simplified illustration; in practice, the interplay between features can be more subtle and complex.
For robust feature selection and to enhance model explainability, we employ automated feature selection techniques that are thoroughly documented in our project's documentation. These methods go beyond pairwise correlations, considering the multidimensional structure of the data to retain the most informative features. While the current example is not exhaustive, it serves to highlight a fundamental step in preprocessing for linear models. Practitioners are encouraged to leverage our automatic feature selection capabilities to refine their models further and to ensure that the explanatory variables employed are truly reflective of independent factors influencing the response variable.
Split the train and validation set
2. Model Training
Initial Model Training
3. Model Optimization
Evolutionary Network Optimization
Simply by fitting a combination of 6 Tighten and Evolution layers we have decreased the MAE by approximately 90. Play around with more layers to see if it's possible to obtain better results.
Comparing against the validation set
4. Model Interpretability and Explainability
Model Feature Importance Analysis
Explaining the variance in the Error Plot
Prior to examining the detailed error plot, it is essential to consider the real-world operational differences among various bulldozer models, as well as the insights provided by subject matter experts (SMEs). These differences are likely to manifest as distinct groupings in the predicted versus actual results. Each model type's unique characteristics—such as age, usage and maintenance history factors that could create these groups, affecting the sale prices and thus the prediction accuracy. Recognizing these potential variances will prepare us to understand and address the disparities in the predictive performance across different Model IDs that the following plot will reveal.
Insights from Scatter Plot Analysis
The scatter plot displayed above demonstrates a significant variation in the predictive accuracy across different Model IDs, as indicated by the spread of points in relation to the black dashed line, which represents perfect prediction. Models such as those in the yellow cluster are closely aligned with the line, suggesting higher prediction accuracy for these Model IDs. This observation underscores the importance of partitioning the dataset to develop model-specific predictive algorithms. By doing so, we can account for the unique characteristics of each model, which may include factors specific to the model that affect the score contributions.
5. Model Persistence
Save Model to Xplainable Cloud
Step 1: Instantiate the Client
Connect to the Xplainable API using your provided API key and the local hostname. This allows further interaction with the platform for model creation and deployment.
Step 2: Create a Model
Define and create a machine learning model on the Xplainable platform. This includes setting a name, description, and providing training features (X_train) and targets (y_train).
6. Model Deployment
Deploy Model for Inference
Step 4: Activate the Deployment
Activate the model deployment so that it’s ready to receive inference requests.
Step 5: Generate a Deploy Key
Generate an API deploy key for secure access to the deployed model. This key will be used to authenticate when making prediction requests.
Step 6: Format a Sample Input
Prepare a single test sample (excluding the target column SalePrice) to be used for model inference. This sample is converted to JSON format for use in an API call.
Step 7: Send a Prediction Request
Make a POST request to the Xplainable inference endpoint with the sample input. The deploy_key is included in the headers for authentication, and the model returns a prediction based on the JSON-formatted input data.
7. Partitioned Models
Enhanced Model Performance with Partitioning
The Power of Partitioned Models in Price Prediction
When predicting prices for heavy equipment like in the Bluebook Dozer Price
Prediction challenge, one-size-fits-all models often fall short. Different equipment
models (ModelIDs) can have vastly different characteristics—age, usage patterns,
depreciation curves, and market dynamics. Trying to capture all of that in a single
global model can dilute performance.
What is a Partitioned Model?
A partitioned model means training separate models for each subgroup or
partition in the data—in this case, for each unique ModelID. Instead of fitting one
global model to the entire dataset, you're allowing the model to specialize based on
contextual differences.
In Xplainable, this can be achieved seamlessly by training per-group models through the auto-training UI or the client.
Evaluation of Model Predictions Against Validation Data
The scatter plot illustrates our model's performance on the validation set, comparing the true values against the predicted values for various bulldozer models. While the trend line shows that our model predictions are generally aligned with the true values, there is an observable underprediction across the data points, as evidenced by the mean absolute error (MAE) of 3599 vs 3212 on the train.
Considerations for Model Refinement:
-
The impact of the mining boom in Australia in 2012, referenced from the Reserve Bank of Australia's report, suggests an economic context that may influence equipment prices. Incorporating macroeconomic indicators could potentially enhance the model's predictive accuracy.
-
Introducing time series features that capture year-over-year changes could offer a more nuanced understanding of price fluctuations over time, rather than relying solely on 'Age at Sale', which may not fully encapsulate such trends.
These considerations point towards the inclusion of external economic factors and more sophisticated time-based features to improve the model's prediction capabilities. Further analysis and iterative model tuning will be required to reduce the prediction error and align the model outputs more closely with the validation data.
Further Investigation:
- An analysis of the trend line derived from time series splits (Age at Sale) could reveal insights into future forecasting capabilities. By extending this trend line, we can project forward forecasts that anticipate equipment prices. This approach could be particularly beneficial for capturing the trajectory of market shifts influenced by macroeconomic trends, such as the mining boom.
Should anyone be interested in contributing to the development of this predictive feature or investigating this further, please feel free to add to the issues on our repository or contact us directly at [email protected].
Access model partitions and plot explanations
Step 3: Deploy the Model
Deploy the model to make it available for inference. You’ll use the version ID returned from the model creation step to deploy this specific version.
Step 4: Activate the Deployment
Activate the model deployment so that it’s ready to receive inference requests.
Step 5: Generate a Deploy Key
Generate an API deploy key for secure access to the deployed model. This key will be used to authenticate when making prediction requests.
Step 6: Format a Sample Input
Prepare a single test sample (excluding the target column SalePrice) to be used for model inference. This sample is converted to JSON format for use in an API call.
Step 7: Send a Prediction Request
Make a POST request to the Xplainable inference endpoint with the sample input. The deploy_key is included in the headers for authentication, and the model returns a prediction based on the JSON-formatted input data.